Introduction

Aim: To perform differential expression analysis in COPD experiments, contransting COPD vs Control group

We previously had run the following scripts:

  • 0: Data selection /0-Data-selection.RMD

  • 1: Dowload RAW data /1-download_raw-data.RMD

  • 2: Normalizing data /2-normalizing-data.RMD

  • 3: Curating data /3-curatig-data.RMD

Input

This script needs the following files, which were obtained by step 3 /3-curatig-data.RMD:

Table: GSE summary 2020-10-05_GSE_Summary.csv
Data 2: ExpressionSet objects with normalized and curated data 2020-10-05_GSE_LungTissue-CURATED.RDS

Setup

For running the script, type:

ssh -X aaltamirano@dna.lavis.unam.mx
qrsh
cd /mnt/Genoma/amedina/DataDNA/R-projects/Meta-analysis_COPD/output_data/.out
 module load r/4.0.1 

## OR
ssh ana@10.200.0.42
cd /home/ana/DataDNA/R-projects/Meta-analysis_COPD/output_data/.out

nohup R -e "rmarkdown::render(here::here('vignettes/4-DE.RMD'))" > 4-DE.RMD.out &

The script can be found in: /home/ana/DataDNA/R-projects/Meta-analysis_COPD/vignettes. And the directories were set using R/setup.R (e.i. DATA, OUPUT, DOWNLOAD) as a small functions that will paste the names into complete paths.

knitr::opts_knit$set(root.dir = here::here())
source(here::here("R/setup.R"))

And this analysis is run in: /home/ana/DataDNA/R-projects/Meta-analysis_COPD

Libraries

library(tidyverse)
library(knitr)
library(DESeq2)
library(limma)
library(oligo)
library(vsn)

Experiments

We selected experiments from lung samples of COPD patients and that also have a control group to compare. These experiments are described in the following table:

gse_table <- read.csv(OUTPUT("2020-10-07_Step3_Summary.csv"), row.names = 1)
kable(gse_table, caption = "GSE information") %>%
  kableExtra::scroll_box(width = "100%", height = "100px")
GSE information
CONTROL COPD COUNTRY SUBMISSION_DATE PLATFORM NUM_GENES AGE_MEAN AGE_MEDIAN AGE_RANGE SEX_F SEX_M SMOKING_STATUS PACKYEAR
GSE27597 16 48 USA Mar 01 2011 GPL5175 17325 59.62500 60.0 55-63 24 40 NA 30.00
GSE106986 5 14 Germany Nov 16 2017 GPL13497 21690 66.84211 69.0 44-79 7 12 NA NA
GSE37768 20 18 Spain May 04 2012 GPL570 23521 NA NA NA NA NA NA NA
GSE57148 91 96 South Korea Apr 28 2014 GPL11154 25288 NA NA NA NA NA NA NA
GSE47460 NA 145 USA May 29 2013 GPL14550 15181 63.48019 65.0 23-87 NA NA 1-Current, 2-Ever (>100), 3-Never, 17, 291, 103 NA
GSE38974 9 23 USA Jun 27 2012 GPL4133 19750 61.26087 62.0 44-79 NA NA SMOKER, 32 40.00
GSE8581 19 16 USA Jul 12 2007,Jul 13 2007,Jul 17 2007,Jul 20 2007,Jul 24 2007,Jul 25 2007 GPL570 23521 65.77193 67.0 39-84 NA NA NA
GSE1650 12 18 USA Aug 06 2004 GPL96 13516 NA NA NA NA NA NA NA
GSE1122 5 10 USA Mar 09 2004 GPL80 5615 NA NA NA NA NA NA NA
GSE119040 4 6 Argentina Aug 24 2018 GPL96 13516 NA NA NA NA NA NA NA
GSE103174 16 37 Spain Aug 28 2017 GPL13667 15021 66.15094 67.0 48-80 NA NA no, yes, 24, 29 40.25
GSE124180 15 6 USA Dec 20 2018 GPL16791 52393 62.32857 61.3 45.3-73.8 NA NA Current, Former, Never, 7, 10, 4 NA

Local functions

DE: Differential expression analysis

Using this funtion, we get a table with differential expression gene results using limma package for fitting a linear model to get genes deferentially expressed between a “Control” and a “COPD” group.

Input: GSE ID, optional: colCOPD is the column name in which the information of disease status can be found, coeff will show results of contrast with coeffitient found in possiton 2
Output: Table of differential expression results with all genes

DE <- function(ExpressionSet,colCOPD="DISEASE",coeff= "DISEASECOPD"){
     # it creates the design matrix and performs limma
     fit <- lmFit(ExpressionSet, model.matrix(as.formula(paste("~ 1 +", colCOPD)),
                                              data = pData(ExpressionSet)))
     # eBayes in lmFit model
     ebf <- eBayes(fit)
     print(colnames(coef(fit)))
     # It gets the genes with the p-values
     volcanoplot(ebf,coef = coeff,highlight=20, pch=20)
     res <- topTable(ebf, number = Inf, p.value = 1, coef = coeff,confint=T)
     # It formats in a tibble
     res <- as_tibble(res,rownames="rownames")
}

For microarray experiments, all the following steps are the same, so we will wrap the following code into one function.

Input: Position of the GSE id inside of geo list (e.i. 1,2..)
Output: Table with DE results and a .csv genes

save_DE <- function(i){
  
g <- geo[[i]][[1]]
message("plotting data")
boxplot(g)
hist(g)

# DE using a local function. DISEASE is a column with disease status information
message("DE analysis")
de <- DE(g,colCOPD = "DISEASE")

# renaming colnames with the GSE id
message("Renaming colnames with GSE id")
colnames(de) <- str_c(colnames(de),"_",names(geo)[i])
colnames(de)

# saving the results in a csv
message("Saving data into csv")
write_csv(de,path= OUTPUT(c("DE/",TODAY,"_Step4_TableGenes_",i,"_",names(geo[i]),".csv")))

# saving the results to later combine all the information 
 return(de)
}

Analysis

DATA

geo <- readRDS(OUTPUT("2020-10-07_Step3_LungTissue-CURATED.RDS"))
names(geo)
##  [1] "GSE27597"  "GSE106986" "GSE37768"  "GSE57148"  "GSE47460"  "GSE38974" 
##  [7] "GSE8581"   "GSE1650"   "GSE1122"   "GSE119040" "GSE103174" "GSE124180"

Differential expression analysis

Using DE() function (described above), we performed a lineal regression model to calculate the logarithm fold change of all genes between a “Control” and a “COPD” group. We also rename colnames adding the GSE ID at the end and finally, we save the output in a .CSV file.

GSE27597

i <- 1
de1 <- save_DE(i)
## plotting data

## DE analysis

## [1] "(Intercept)" "DISEASECOPD"
## Renaming colnames with GSE id
## Saving data into csv
## Warning: The `path` argument of `write_csv()` is deprecated as of readr 1.4.0.
## Please use the `file` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

GSE106986

i <- 2
de1 <- save_DE(i)
## plotting data

## DE analysis

## [1] "(Intercept)" "DISEASECOPD"
## Renaming colnames with GSE id
## Saving data into csv

GSE37768

i <- 3
de3 <- save_DE(i)
## plotting data

## DE analysis

## [1] "(Intercept)" "DISEASECOPD"
## Renaming colnames with GSE id
## Saving data into csv

GSE57148

This experiment is RNAseq, the data will be download from Recount2. I’m following Recount2 vignette Using DESeq2package we calculated DEG.

## Specify design and switch to DESeq2 format
i <- 4
rse <-  geo[[i]][[1]]

 dds <- DESeqDataSet(rse, ~ DISEASE)
## converting counts to integer mode
## Warning in DESeqDataSet(rse, ~DISEASE): some variables in design formula are
## characters, converting to factors
 ## Perform DE analysis
 dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 1439 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
 boxplot(log10(assays(dds)[["cooks"]]), range=0, las=2)

 ntd <- normTransform(dds)
 meanSdPlot(assay(ntd))

 res <- results(dds)
 plotMA(res, ylim=c(-2,2))

 # Calculates de CI
 res$error <- qnorm(0.975)*res$lfcSE
 res$CI.L <- res$log2FoldChange-res$error
 res$CI.R <- res$log2FoldChange+res$error
 
 res
## log2 fold change (MLE): DISEASE COPD vs CONTROL 
## Wald test p-value: DISEASE COPD vs CONTROL 
## DataFrame with 58037 rows and 9 columns
##                      baseMean log2FoldChange     lfcSE       stat      pvalue
##                     <numeric>      <numeric> <numeric>  <numeric>   <numeric>
## ENSG00000000003.14  924.46223     -0.1311028 0.0763894  -1.716244   0.0861174
## ENSG00000000005.5     2.85737     -0.4396692 0.4629495  -0.949713   0.3422580
## ENSG00000000419.12 1089.25569      0.0152275 0.0300546   0.506663   0.6123911
## ENSG00000000457.13  698.75730     -0.0656469 0.0405149  -1.620314   0.1051648
## ENSG00000000460.16  347.79272     -0.0502240 0.0465180  -1.079669   0.2802896
## ...                       ...            ...       ...        ...         ...
## ENSG00000283695.1   0.0454355     -0.1779953 1.9047251 -0.0934493 9.25547e-01
## ENSG00000283696.1  37.5449930      0.5120456 0.0933836  5.4832482 4.17586e-08
## ENSG00000283697.1  21.2156969      0.0782234 0.1087300  0.7194274 4.71878e-01
## ENSG00000283698.1   0.1327733     -0.1294404 0.9741244 -0.1328787 8.94289e-01
## ENSG00000283699.1   0.0646975      0.1592290 1.3843041  0.1150246 9.08426e-01
##                           padj     error       CI.L      CI.R
##                      <numeric> <numeric>  <numeric> <numeric>
## ENSG00000000003.14    0.177940 0.1497204 -0.2808232 0.0186176
## ENSG00000000005.5     0.502138 0.9073642 -1.3470334 0.4676951
## ENSG00000000419.12    0.743547 0.0589059 -0.0436783 0.0741334
## ENSG00000000457.13    0.207758 0.0794078 -0.1450547 0.0137609
## ENSG00000000460.16    0.435330 0.0911736 -0.1413976 0.0409495
## ...                        ...       ...        ...       ...
## ENSG00000283695.1           NA  3.733193  -3.911188  3.555197
## ENSG00000283696.1  5.80204e-07  0.183029   0.329017  0.695074
## ENSG00000283697.1  6.26366e-01  0.213107  -0.134884  0.291330
## ENSG00000283698.1           NA  1.909249  -2.038689  1.779808
## ENSG00000283699.1           NA  2.713186  -2.553957  2.872415
r <- as_tibble(res, rownames="rownames")
r <- full_join(r,as.data.frame(rowData(geo[[i]][[1]])), by=c("rownames"="gene_id")) 
r$logFC <- r$log2FoldChange

colnames(r) <- str_c(colnames(r),"_",names(geo[i]))
colnames(r)
##  [1] "rownames_GSE57148"       "baseMean_GSE57148"      
##  [3] "log2FoldChange_GSE57148" "lfcSE_GSE57148"         
##  [5] "stat_GSE57148"           "pvalue_GSE57148"        
##  [7] "padj_GSE57148"           "error_GSE57148"         
##  [9] "CI.L_GSE57148"           "CI.R_GSE57148"          
## [11] "bp_length_GSE57148"      "symbol_GSE57148"        
## [13] "GENE.SYMBOL_GSE57148"    "logFC_GSE57148"
r <- r[,-12]
colnames(r)
##  [1] "rownames_GSE57148"       "baseMean_GSE57148"      
##  [3] "log2FoldChange_GSE57148" "lfcSE_GSE57148"         
##  [5] "stat_GSE57148"           "pvalue_GSE57148"        
##  [7] "padj_GSE57148"           "error_GSE57148"         
##  [9] "CI.L_GSE57148"           "CI.R_GSE57148"          
## [11] "bp_length_GSE57148"      "GENE.SYMBOL_GSE57148"   
## [13] "logFC_GSE57148"
write_csv(r,path= OUTPUT(c("DE/",TODAY,"_Step4_TableGenes_",i,"_",names(geo[i]),".csv")))


de4 <- r

GSE47460

i <- 5
de5 <- save_DE(i)
## plotting data

## DE analysis

## [1] "(Intercept)"                      "DISEASECOPD"                     
## [3] "DISEASEInterstitial lung disease"
## Renaming colnames with GSE id
## Saving data into csv

GSE38974

i <- 6
de6 <- save_DE(i)
## plotting data

## DE analysis

## [1] "(Intercept)" "DISEASECOPD"
## Renaming colnames with GSE id
## Saving data into csv

GSE8581

i <- 7
de7 <- save_DE(i)
## plotting data

## DE analysis

## [1] "(Intercept)"         "DISEASECOPD"         "DISEASEUnclassified"
## Renaming colnames with GSE id
## Saving data into csv

GSE1650

i <- 8
de8 <- save_DE(i)
## plotting data

## DE analysis

## [1] "(Intercept)" "DISEASECOPD"
## Renaming colnames with GSE id
## Saving data into csv

GSE1122

i <- 9
de9 <- save_DE(i)
## plotting data

## DE analysis

## [1] "(Intercept)" "DISEASECOPD"
## Renaming colnames with GSE id
## Saving data into csv

GSE119040

i <- 10
de10 <- save_DE(i)
## plotting data

## DE analysis

## [1] "(Intercept)" "DISEASECOPD"
## Renaming colnames with GSE id
## Saving data into csv

GSE103174

i <- 11
de11 <- save_DE(i)
## plotting data

## DE analysis

## [1] "(Intercept)" "DISEASECOPD"
## Renaming colnames with GSE id
## Saving data into csv

GSE124180

i <- 12
rse <-  geo[[i]][[1]]

counts <- t(apply(exprs(rse), 1,as.numeric))
colnames(counts) <- colnames(rse)
colData <- pData(rse)

rse <- SummarizedExperiment(assays=list(counts=counts), colData=colData)
rowData(rse) <- fData(geo[[i]][[1]])

 dds <- DESeqDataSet(rse, ~ DISEASE)
## converting counts to integer mode
## Warning in DESeqDataSet(rse, ~DISEASE): some variables in design formula are
## characters, converting to factors
 ## Perform DE analysis
 dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 123 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
 boxplot(log10(assays(dds)[["cooks"]]), range=0, las=2)

 ntd <- normTransform(dds)
 meanSdPlot(assay(ntd))

 res <- results(dds)
 plotMA(res, ylim=c(-2,2))

 # Calculates de CI
 res$error <- qnorm(0.975)*res$lfcSE
 res$CI.L <- res$log2FoldChange-res$error
 res$CI.R <- res$log2FoldChange+res$error
 
 res
## log2 fold change (MLE): DISEASE COPD vs CONTROL 
## Wald test p-value: DISEASE COPD vs CONTROL 
## DataFrame with 54140 rows and 9 columns
##                  baseMean log2FoldChange     lfcSE      stat    pvalue
##                 <numeric>      <numeric> <numeric> <numeric> <numeric>
## ENSG00000000419  419.9005      0.1982886  0.158400  1.251821  0.210635
## ENSG00000000457  433.5534     -0.0283818  0.105238 -0.269691  0.787398
## ENSG00000000460  108.3336     -0.1826590  0.181100 -1.008609  0.313162
## ENSG00000000938   40.8155      0.5006423  0.492299  1.016947        NA
## ENSG00000000971 2305.8074     -0.0889485  0.186240 -0.477600  0.632935
## ...                   ...            ...       ...       ...       ...
## ENSG00000281904  0.283249      -0.441699  2.912079 -0.151678 0.8794407
## ENSG00000281909  0.360405      -0.580817  1.627405 -0.356898 0.7211683
## ENSG00000281910  0.000000             NA        NA        NA        NA
## ENSG00000281912  3.582084      -0.352601  0.477911 -0.737796 0.4606383
## ENSG00000281920  1.303847      -2.045900  1.059108 -1.931720 0.0533941
##                      padj     error      CI.L      CI.R
##                 <numeric> <numeric> <numeric> <numeric>
## ENSG00000000419  0.934453  0.310459 -0.112170  0.508747
## ENSG00000000457  0.981989  0.206263 -0.234645  0.177881
## ENSG00000000460  0.952026  0.354949 -0.537608  0.172290
## ENSG00000000938        NA  0.964889 -0.464247  1.465531
## ENSG00000000971  0.967615  0.365024 -0.453973  0.276076
## ...                   ...       ...       ...       ...
## ENSG00000281904        NA  5.707570  -6.14927 5.2658709
## ENSG00000281909        NA  3.189655  -3.77047 2.6088379
## ENSG00000281910        NA        NA        NA        NA
## ENSG00000281912  0.961404  0.936689  -1.28929 0.5840877
## ENSG00000281920        NA  2.075814  -4.12171 0.0299139
r <- as_tibble(res, rownames="rownames")
r <- full_join(r,as.data.frame(rowData(rse)), by=c("rownames"="GENEID")) 
r$logFC <- r$log2FoldChange

colnames(r) <- str_c(colnames(r),"_",names(geo[i]))
colnames(r)
##  [1] "rownames_GSE124180"       "baseMean_GSE124180"      
##  [3] "log2FoldChange_GSE124180" "lfcSE_GSE124180"         
##  [5] "stat_GSE124180"           "pvalue_GSE124180"        
##  [7] "padj_GSE124180"           "error_GSE124180"         
##  [9] "CI.L_GSE124180"           "CI.R_GSE124180"          
## [11] "GENENAME_GSE124180"       "SEQNAME_GSE124180"       
## [13] "GENESEQSTART_GSE124180"   "GENESEQEND_GSE124180"    
## [15] "GENE.SYMBOL_GSE124180"    "logFC_GSE124180"
write_csv(r,path= OUTPUT(c("DE/",TODAY,"_Step4_TableGenes_",i,"_",names(geo[i]),".csv")))

# saving the results to later combine all the information 
de12 <- r

Output

This script produces the following data, and can be found in /home/ana/DataDNA/R-projects/Meta-analysis_COPD

Tables with DE results: Tables with log fold change and p-values calculated per experiment and saved in csv files

Session Info

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=it_IT.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=it_IT.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] vsn_3.56.0                  oligo_1.52.1               
##  [3] Biostrings_2.56.0           XVector_0.28.0             
##  [5] oligoClasses_1.50.4         limma_3.44.3               
##  [7] DESeq2_1.28.1               SummarizedExperiment_1.18.2
##  [9] DelayedArray_0.14.1         matrixStats_0.57.0         
## [11] Biobase_2.48.0              GenomicRanges_1.40.0       
## [13] GenomeInfoDb_1.24.2         IRanges_2.22.2             
## [15] S4Vectors_0.26.1            BiocGenerics_0.34.0        
## [17] knitr_1.30                  forcats_0.5.0              
## [19] dplyr_1.0.2                 purrr_0.3.4                
## [21] readr_1.4.0                 tidyr_1.1.2                
## [23] tibble_3.0.3                ggplot2_3.3.2              
## [25] tidyverse_1.3.0             stringr_1.4.0              
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.4-1       ellipsis_0.3.1         rprojroot_1.3-2       
##  [4] fs_1.5.0               rstudioapi_0.11        hexbin_1.28.1         
##  [7] farver_2.0.3           affyio_1.58.0          bit64_4.0.5           
## [10] AnnotationDbi_1.50.3   fansi_0.4.1            lubridate_1.7.9       
## [13] xml2_1.3.2             codetools_0.2-16       splines_4.0.2         
## [16] geneplotter_1.66.0     jsonlite_1.7.1         broom_0.7.1           
## [19] annotate_1.66.0        dbplyr_1.4.4           BiocManager_1.30.10   
## [22] compiler_4.0.2         httr_1.4.2             backports_1.1.10      
## [25] assertthat_0.2.1       Matrix_1.2-18          cli_2.0.2             
## [28] htmltools_0.5.0        tools_4.0.2            gtable_0.3.0          
## [31] glue_1.4.2             GenomeInfoDbData_1.2.3 affy_1.66.0           
## [34] affxparser_1.60.0      Rcpp_1.0.5             cellranger_1.1.0      
## [37] vctrs_0.3.4            preprocessCore_1.50.0  iterators_1.0.12      
## [40] xfun_0.18              rvest_0.3.6            lifecycle_0.2.0       
## [43] XML_3.99-0.5           zlibbioc_1.34.0        scales_1.1.1          
## [46] BiocStyle_2.16.1       hms_0.5.3              RColorBrewer_1.1-2    
## [49] yaml_2.2.1             memoise_1.1.0          stringi_1.5.3         
## [52] RSQLite_2.2.1          highr_0.8              genefilter_1.70.0     
## [55] foreach_1.5.0          BiocParallel_1.22.0    rlang_0.4.7           
## [58] pkgconfig_2.0.3        bitops_1.0-6           evaluate_0.14         
## [61] lattice_0.20-41        labeling_0.3           bit_4.0.4             
## [64] tidyselect_1.1.0       here_0.1               magrittr_1.5          
## [67] R6_2.4.1               generics_0.0.2         DBI_1.1.0             
## [70] pillar_1.4.6           haven_2.3.1            withr_2.3.0           
## [73] survival_3.2-3         RCurl_1.98-1.2         modelr_0.1.8          
## [76] crayon_1.3.4           rmarkdown_2.3          locfit_1.5-9.4        
## [79] grid_4.0.2             readxl_1.3.1           blob_1.2.1            
## [82] webshot_0.5.2          reprex_0.3.0           digest_0.6.25         
## [85] xtable_1.8-4           ff_4.0.2               munsell_0.5.0         
## [88] viridisLite_0.3.0      kableExtra_1.2.1